Unsupervised Formation Matching in Highly Inflected Languages
نویسندگان
چکیده
There have been multiple attempts to resolve various inflection matching problems in information retrieval. Stemming is a common approach to this end. Among many techniques for stemming, statistical stemming has been shown to be effective in a number of languages, particularly highly inflected languages. In this paper we propose a method for finding affixes in different positions of a word. Common statistical techniques heavily rely on string similarity in terms of prefix and suffix matching. Since infixes are common in irregular/informal inflections in morphologically complex texts, it is required to find infixes for stemming. In this paper we propose a method whose aim is to find statistical inflectional rules based on minimum edit distance table of word pairs and the likelihoods of the rules in a language. These rules are used to statistically stem words and can be used in different text mining tasks. Experimental results on CLEF 2008 and CLEF 2009 English-Persian CLIR tasks indicate that the proposed method significantly outperforms all the baselines in terms of MAP.
منابع مشابه
Enhancing Morphological Alignment for Translating Highly Inflected Languages
We propose an unsupervised approach utilizing only raw corpora to enhance morphological alignment involving highly inflected languages. Our method focuses on closed-class morphemes, modeling their influence on nearby words. Our languageindependent model recovers important links missing in the IBM Model 4 alignment and demonstrates improved end-toend translations for English-Finnish and English-...
متن کاملA Lemma Based Evaluator for Semitic Language Text Summarization Systems
Matching texts in highly inflected languages such as Arabic by simple stemming strategy is unlikely to perform well. In this paper, we present a strategy for automatic text matching technique for for inflectional languages, using Arabic as the test case. The system is an extension of ROUGE test in which texts are matched on token's lemma level. The experimental results show an enhancement of de...
متن کاملJezikovno neodvisno modeliranje pregibnega jezika
This article concerns statistical language modelling of Slovenian language for automatic speech recognition. We investigate various techniques for overcoming the difficulties in modelling highly inflected languages. Slavic languages are particularly challenging languages and Slovenian language is one of them. Two main problems arise when modelling Slovenian language in comparison to English. Th...
متن کاملMorpheme Segmentation for Kannada Standing on the Shoulder of Giants
This paper studies the applicability of a set of state-of-the-art unsupervised morphological segmentation algorithms for the problem of morpheme boundary detection in Kannada, a resource-poor language with highly inflectional and agglutinative morphology. The choice of the algorithms for the experiment is based in part on their performance with highly inflected languages such as Finnish and Ben...
متن کاملRich morpho-syntactic descriptors for factored machine translation with highly inflected languages as target
The baseline phrase-based translation approach has limited success on translating between languages with very different syntax and morphology, especially when the translation direction is from a language with fixed word structure to a highly inflected language. There are two main points to improve on: morphological translation equivalence and long range reordering. Translating the correct surfa...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1605.07852 شماره
صفحات -
تاریخ انتشار 2016